Byzantine Fault Tolerance, from Theory to Reality
نویسندگان
چکیده
Since its introduction nearly 20 years ago, the Byzantine Generals Problem has been the subject of many papers having the scrutiny of the fault tolerance community. Numerous Byzantine tolerant algorithms and architectures have been proposed. However, this problem is not yet sufficiently understood by those who design, build, and maintain systems with high dependability requirements. Today, there are still many misconceptions relating to Byzantine failure, what makes a system vulnerable, and indeed the very nature and reality of Byzantine faults. This paper revisits the Byzantine problem from a practitioner’s perspective. It has the intention to provide the reader with a working appreciation of the Byzantine failure from a practical as well as a theoretical perspective. A discussion of typical failure properties and the difficulties in preventing the associated failure propagation is presented. These are illustrated with real Byzantine failure observations. Finally, various architectural solutions to the Byzantine problem are presented. 1 What You Thought Could Never Happen In English, the phrase “one in a million” is popularly used to describe the highly improbable. The ratio itself is difficult to comprehend. The easiest way to give it reason is to equate it to real-world expectations. For example, the probability of winning the U.K. National Lottery is around one in fourteen million; the probability of getting struck by lightning in the U.S. is around one in six hundred thousand [1]. It is not safe to rely on intuition for reasoning about unfathomably small probabilities (for example, the 1-in-1,000,000,000 maximum failure probability for critical aerospace systems1). It is problematic in two ways: (1) real-world parallels are beyond typical human experience and comprehension; (2) faults that are not recognized, such as Byzantine faults, are incorrectly assumed to occur with zero or very low probability. The lack of recognition causes additional issues in that it allows the manifestation of such faults to pass unnoticed or be otherwise misclassified, reinforcing the misconception of low probability of occurrence. 1 Usually written as a failure rate of 10/hr 236 K. Driscoll et al. The lack of recognition leads to repeating the “Legionnaire’s Disease” phenomenon. After its “discovery” in 1976, a search of medical records found that the disease had seldom occurred for many decades. The lack of shared knowledge and the disease’s rarity made each occurrence appear to be unique. Only after 1976 was it realized that all these “unique” occurrences had a common cause. Similarly, an observation of a Byzantine failure will not be recognized as being an instance of a known class of failure by those who are not intimately familiar with Byzantine failures. The intent of this paper is to redress this situation. Drawing from the authors’ experiences with Byzantine failures in real-world systems, this paper shows that Byzantine problems are real, have nasty properties, and are likely to increase in frequency with emerging technology trends. Some of the myths with respect to the containment of Byzantine faults are dispelled and suitable mitigation strategies and architectures are discussed. 2 The Byzantine Army Is Growing The microprocessor revolution has seen electronic and software technologies proliferate into almost every domain of life, including an increasing responsibility in safety-critical domains. High assurance processes (e.g. DO-254 [2]) have been matured to manage the development of high integrity systems. The cost of using high assurance processes is relatively high in comparison to equivalent levels of functionality in commercial counterparts. In recent years, there has been a push to adopt commercial off-the-shelf (COTS) technology into high integrity systems. The timing of the COTS push is alarming, considering the decreasing reliability and dependability trends emerging within the COTS integrated circuit (IC) arena. With increasing clock frequencies, decreasing process geometries, and decreasing power supply voltages, studies [3] conclude that the dependability of modern ICs is decreasing. This development reverses the historical trend of increasing IC dependability. Bounding the failure behaviors of emerging COTS ICs will become increasingly more difficult. The expected lifetime of ICs is also decreasing, as modes of “silicon wear-out” (i.e. time-dependent dielectric breakdown, hot carrier aging, and electro-migration) become more significant. The anticipated working life of current ICs may be on the order of 5-10 years. Although viable in the commercial arena, this is a world away from the requirements for high integrity systems, which have traditionally had deployment lifetimes ranging over multiple decades. Strategies to mitigate the problems are in development [4]. It is safe to assume that the anticipated rate of IC failure will increase and the modes of failure will become ever more difficult to characterize. Another significant trend is the move towards more distributed, safety-critical processing system topologies. These distributed architectures appear to be favored for the emerging automotive “by-wire” control systems [5], where a new breed of safety critical communications protocols and associated COTS are being developed [6]. Such technologies, if developed in compliance with the traditional high assurance processes, show promise to answer the cost challenges of high integrity applications. However, it is imperative that such technologies are developed with full knowledge of all possible failure modes. Distributed control systems, by their very nature, require Byzantine Fault Tolerance, from Theory to Reality 237 consensus among their constituent elements. The required consensus might be lowlevel (e.g. in the form of mutual synchronization) or at a higher level (e.g. in the form of some coordinated system action such as controlling a braking force). Addressing Byzantine faults that can disrupt consensus is thus a crucial system requirement. We expect Byzantine faults to be of increasing concern given the two major trends described in this section: (1) Byzantine faults are more likely to occur due to trends in device physics; (2) systems are becoming more vulnerable due to increasing emphasis on safety-critical distributed topologies. It is therefore imperative that Byzantine faults and failure mechanisms become widely understood and that the design of safety-critical systems includes mitigation strategies. 3 Parables from the Classical Byzantine Age The initial definition of Byzantine failure was given in a landmark paper by Lamport et al [7]. They present the scenario of a group of Byzantine Generals whose divisions surround an enemy camp. After observing the enemy, the Generals must communicate amongst themselves and come to consensus on a plan of action— whether to attack or retreat. If they all attack, they win; if none attack, they live to fight another day. If only some of the generals attack, then the generals will die. The generals communicate via messages. The problem is that one or more of the generals may be traitors who send inconsistent messages to disrupt the loyal generals from reaching consensus. The original paper discusses the problem in the context of oral messages and written signed messages that are attributed different properties. Oral messages are characterized as follows: A1. Every message that is sent is delivered correctly (messages are not lost). A2. The receiver of a message knows who sent it. A3. The absence of a message can be detected. For messages with these properties, it has been proven that consensus cannot be achieved with three generals, if one of the generals is assumed to be a traitor. A solution is presented in which each of four general exchanges information with his peers and a majority vote makes selections over all of the data exchanged. This solution is generalized to accommodate multiple traitors, concluding: to tolerate m traitorous generals, requires 3 m + 1 generals utilizing m + 1 rounds of information exchange. Written, signed messages assume all of the properties (A1-A3) of the oral messages, and are further characterized by the properties below: A4. A loyal general’s signature cannot be forged. A5. Anyone can verify the authenticity of a signature. Assuming the signed message properties above, it is shown that consensus is possible with just three generals, using a simple majority voting function. The solution is further generalized to address multiple fault scenarios, concluding: to tolerate m traitorous generals requires 2m + 1 loyal generals and m + 1 rounds of information exchange. The initial proofs presented in the paper assume that all generals communicate directly with one another. The assumptions are later relaxed to address topologies of 238 K. Driscoll et al. less connectivity. It is proven that for oral messages, consensus is possible if the generals are connected in a p regular graph, where p > 3m 1. For signed (authenticated) messages, it is proven that consensus is possible if the loyal generals are connected to each other. However, this solution requires additional rounds of information exchange Since its initial presentation, nearly two decades ago, the Byzantine Generals problem has been the subject of intense academic study, leading to the development and formal validation of numerous Byzantine-tolerant algorithms and architectures. As stated previously, industry’s recognition and treatment of the problem has been far less formal and rigorous. A reason for this might be the anthropomorphic tone and presentation of the problem definition. Although the authors warned against adopting too literal an interpretation, much of the related literature that has followed the original text has reinforced the “traitorous” anthropomorphic failure model. Such treatment has resulted in the work being ignored by a large segment of the community. Practicing engineers, who intuitively know that processors have no volition and cannot “lie,” quickly dismiss concepts of “traitorous generals” hiding within their digital systems. Similarly, while the arguments of unforgeable signed messages make sense in the context of communicating generals, the validity of necessary assumptions in a digital processing environment is not supportable. In fact, the philosophical approach of utilizing cryptography to address the problem within the real world of digital electronics makes little sense. The assumptions required to support the validity of unbreakable signatures are equally applicable to simpler approaches (such as appending a simple source ID or a CRC to the end of a message). It is not possible to prove such assumptions analytically for systems with failure probability requirements
منابع مشابه
A Game Theoretical View of Byzantine Fault Tolerance Design
In this paper, we investigate the optimal Byzantine fault tolerance (BFT) design strategies from a game theoretical point of view. The problem of BFT is formulated as a constant-sum game played by the BFT system (defender) and its adversary (attacker). The defender resorts to replication to ensure high reliability and availability, while the attacker injects faults to the defender with the purp...
متن کاملDistributed Computing Column 39: Byzantine Generals: The Next Generation
The relevance of Byzantine fault tolerance in the context of cloud computing has been questioned[3]. While arguments against Byzantine fault tolerance seemingly makes sense in the context of a singlecloud, i.e., a large-scale cloud infrastructure that resides under control of a single, typically commercialprovider, these arguments are less obvious in a broader context of the Int...
متن کاملImplementing Fault-Tolerant Services Using State Machines: Beyond Replication
This paper describes a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as savings in storage and messages during normal operations. Fused state machines may incur higher overhead during recovery from crash or Byzantine ...
متن کاملTangaroa: a Byzantine Fault Tolerant Raft
We propose a Byzantine Fault Tolerant variant of the Raft consensus algorithm, BFTRaft, inspired by the original Raft[1] algorithm and the Practical Byzantine Fault Tolerance algorithm[2]. BFT Raft maintains the safety, fault tolerance, and liveness properties of Raft in the presence of Byzantine faults, while also aiming towards to Raft’s goal of simplicity and understandability. We have imple...
متن کاملSignature-Free Communication and Agreement in the Presence of Byzantine Processes (Tutorial)
Communication and agreement are fundamental abstractions in any distributed system. (If the computing entities do not need to communicate or agree in one way or another, the system is not a distributed system!) This tutorial was devoted to the design of such abstractions built on top of signature-free asynchronous distributed systems prone to Byzantine process failures. It is made up of three p...
متن کاملRegret Freedom Isn't Free
Cooperative, peer-to-peer (P2P) services—distributed systems consisting of participants from multiple administrative domains (MAD)—must deal with the threat of arbitrary (Byzantine) failures while incentivizing the cooperation of potentially selfish (rational) nodes that such services rely on to function. This paper investigates how to specify conditions (i.e., a solution concept) for rational ...
متن کامل